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The recent “Nationwide academic achievement and study situation sur¬ 
vey” was clearly influenced by the idea of “authentic assessment”, an educa¬ 
tional assessment perspective focused on “quality” and “engagement”. 
However, when “performance assessment”, the assessment method correspond¬ 
ing to this focus, is adopted in academic achievement surveys, it runs the risk 
of turning into a rigid hollow structure. In this paper I will reflect on the ideal 
application of performance assessment in academic achievement surveys, and 
will investigate the concepts of “consequential validity”, “equityand “mod¬ 
eration” in regard to their potential to further develop the discourse. 


1 Stating the Problem 

We can safely say that all the post-war reforms in school education have centered on the 
issue of academic achievement in Japan. Two representative illustrations of this focus are the 1950s 
debate revolving around the issue of “declining academic achievement”, and the discussion on 
“academic achievement on examinations” around 1975. In both of these cases the debate was trig¬ 
gered and fueled by the results of academic achievement surveys. The Kubo Shun’ichi survey 
(1951) and the Public Institute for Educational Research Survey (1975-1976) achieved great fame 
in this way and have exerted influence on the revisions of the curriculum guidelines ever since. 
Likewise, the latest revision (March 2008) was inspired by the academic achievement debate 
around the turn of 21th century. It was furthermore the most fundamental reform in decades in that 
the focus was changed from “New Academic Achievement” to “Solid Academic Achievement 
(comprehensive learning abilities)”. 

Although the current debates at the beginning were solely concerned with declining aca¬ 
demic achievement of college students 1 , the latter academic achievement debate, influenced by aca¬ 
demic achievement surveys and especially 2003PISA, has taken a different course. PISA-type 
“literacy” assessments are being incorporated into the educational policies of countries around the 
world as the universal standard for the new global economy. 2 In Japan, this trend is reflected in 
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that MEXT, concerned about weakening “reading literacy”, in December 2005 proposed the 
“Program for the improvement of reading literacy” and the “Teaching resources for the improve¬ 
ment of reading literacy”. Furthermore, the fact that in the latest revision of the curriculum guide¬ 
line the issue of “linguistic achievement” was emphasized, and that the “Nationwide academic 
achievement and study situation survey”, carried out as a census survey since April 2007, distin¬ 
guishes between A-type problems (concerning basic “knowledge”) and B-type problems (concern¬ 
ing the “application” of knowledge in everyday situations), 3 are also clear signs of influence by 
PISA. 

Whereas in the past the academic achievement surveys served only to trigger and fuel the 
debate on the subject, the current situation is that they guide the course of educational reforms. As 
subjecting the “academic achievement survey” to internal analysis as unraveling its politics 4 has 
therefore become a matter of high priority. In this paper I will analyze the academic achievement 
survey from the perspective of educational assessment theory as internal analysis 5 and will clarify 
the essence of assessment that is currently exerting strong influence on PISA, namely “authentic 
assessment” or “performance assessment”, that is occupied with “quality” and “engagement”. 6 I 
will furthermore examine how the assessment is critically refined as it is adopted in academic 
achievement surveys in USA. Through the above analysis and examination I aim to identify the 
problematic issues concerning the “Nationwide academic achievement and study situation survey” 
and to point out the tasks that lay ahead. 


2 Authentic Assessment 

(a) the context of its conception 

In contrast with TIMSS that measures the extent to which a specific school curriculum is 
mastered, PISA-type “literacy” places emphasis on the proficiency of knowledge and skills that 
children need in order to live their daily lives. “Mathematical literacy” for example is described in 
a functional manner as “concerned with the capacities of students to analyze, reason, and commu¬ 
nicate ideas effectively as they pose, solve, and interpret mathematical problems in a variety of 
situations.” 7 Measuring this kind of PISA-type “literacy” therefore requires the creation of concrete 
problems that are situated in “authentic” contexts. It is in this characteristic that we can clearly 
recognize the influence of the idea of “authentic assessment”. 

The term “authentic assessment” made its first appearance within the context of educational 
assessment in the latter half of the 1980s in America through the work of Wiggins. 8 Inspired by 
the famous report “Nation at Risk” (1983), in which the importance of raising academic achieve¬ 
ment was emphasized, this period witnessed the large-scale introduction of “standardized tests”. 
These “high-stakes” tests were carried out by state governments in order to inspect the educational 
results of schools and school districts, and to respond to demands for accountability. However, 
before too long resistance arose against the top-down enforcement of these tests, and questions 
were asked as to whether educational results could actually be measured trough standardized 
tests. 

“Standardized tests” more often than not present children with artificial and fragmentary 
(one-shot) questions that demand rote memorization of knowledge. Furthermore, they create a “test 
atmosphere” that, by exhibiting various kinds of ritualized aspects, is far removed from the normal 
class atmosphere. These characteristics caused skepticism about the effectiveness of standardized 
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tests in assessing a child’s academic achievement. A high grade on a standardized test might reflect 
a specific kind of ability that is of value within a school, but does it actually guarantee that the 
necessary skills to live and work in society are mastered? And, might not the uniformity of stan¬ 
dardized tests serve to amplify racial and social differences? It is against this background of cri¬ 
tique against standardized tests that the idea of authentic assessment made its first appearance. 9 

(b) criticism towards standardized testing 

Despite his critical stance towards standardized tests, Wiggins maintains that even though 
there are limits to the “test” as an assessment method, we should not discard the method in its 
entirety. 10 When we take the specific criticism towards “standardized tests” to also apply to “tests” 
in general, we not only ignore the fact that standardized tests are specifically “group-based” test¬ 
ing, but we also avoid the important task of critically reconstructing the meaning embedded in 
“tests” as such. 

Wiggins therefore advocates a distinction between “establishing a standard” in order to clar¬ 
ify the common level, and “standardization” in order to support “standardized testing”. In the past, 
“establishing a standard” was basically equated with “standardization” and meant setting up an 
assessment standard (norm) based on the average performance level of a group. However, the “bell 
curve”, used in the analytic process of this kind of standardization, presupposes the existence of 
well performing children and ill performing children. As soon as teachers become accustomed to 
this kind of “relative assessment” they are likely to avoid the difficult, but fruitful task of estab¬ 
lishing criteria that aim to clarify the common academic achievement level . 

Wiggins divides “tests” into “norm-referenced tests” and “criterion-referenced tests” (based 
on “objective-referenced assessment”) and maintains that his emphasis on “authentic assessment” 
should be understood as a further development of “objective-referenced assessment”. “Further 
development” here most likely meaning, orientated towards the development of an assessment 
method that overcomes the limits of tests. However, as I shall explain in further detail later, when 
the assessments methods implied by “authentic assessment” are adopted in large-scale academic 
surveys, there is a risk that they will, contrary to their original intention, transform into devices of 
oppression. There is an increasing awareness of the necessity for the construction of a theory that 
will prevent this kind of transformation. 

(c) the meaning of “authentic” : focusing on “quality” 

Wiggins has described “authentic assessment” as a method that “replicates or simulates the 
contexts in which adults are ‘tested’ in the workplace, in civic life, and in personal life.” 11 Shaklee 
has referred to it as the assessment of children within a process that involves them with “realistic 
tasks.” 12 

The emphasis here on terms like “life” and “realistic tasks” obviously implies criticism 
towards the uniformity and artificiality of the standardized tests. It is supposed that the ability of 
a child is formed through the involvement with “authentic” tasks, and that it is therefore exactly 
this process that should be the object of assessment. Teaching and assessment are thus not thought 
of as separate, but rather as two continuous aspects of the same process. 

Saying that the assessment task has to be “realistic” within the context of the child’s life 
means that the task has to be a familiar and unavoidable part of that child’s life. However, we 
should note that “realistic” within the context of authentic assessment implies more than just famil¬ 
iarity and necessity. Wiggins maintains that “authenticity” in assessment corresponds to the quali- 
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tatively higher-level objectives of “synthesis” and “application” that have been described by Bloom 
in his “Taxonomy of Educational Objectives”. 13 “Synthesis problems” for example are explained 
by Bloom as problems that have not been treated during class but are attempted by children by 
using various materials in a manner like an open-book examination. 

Wiggins thus relates “authenticity” in assessment with the qualitatively higher-level diffi¬ 
culty of the assessment problems posed to children. Problems that reflect the familiar daily lives 
of children inspire them to attempt these problems, but are also particularly difficult in that these 
daily lives are worlds in which the information necessary to solve problems is either abundant or 
scarce, thus demanding the ability to refine and synthesize knowledge. Awareness of this paradox 
between “familiarity” and “difficulty” contained within “authenticity” is of crucial importance to 
a correct understanding of the idea of “authentic assessment”. 


3 Performance Assessment 

(a) restructuring validity and reliability 

We have established that the essence of the idea of “authentic assessment” lies in its focus 
on higher-level “quality”. However, when we want to turn this idea of “authentic assessment” into 
a concrete assessment method, we have to address the methodological principals of “validity” and 
“reliability”. Whereas validity is a concept that describes the extent to which the assessment object 
has been measured accurately, reliability describes the extent to which the assessment object has 
been measured consistently. Although closely related, these concepts have developed in an oppo¬ 
sitional manner; where one was emphasized, the other was neglected. For example, while tests 
made by teachers themselves were only concerned with validity at the expense of reliability, high- 
stakes tests that demanded accountability focused exclusively on reliability while sacrificing valid¬ 
ity. In the case of large-scale academic achievement tests the demand was for reliability. And this 
demand was accordingly satisfied by making almost exclusive use of “objective tests”. 

Overcoming this conflict between validity and reliability requires the development of an 
assessment method of high validity, together with the establishment of an assessment standard that 
is able to secure reliability. This has more than all become the task at hand for “authentic assess¬ 
ment” understood as a further development of “objective-referenced assessment”. The fact that the 
recent “Nationwide academic achievement and study situation survey” established B-type problems 
signifies a commitment to this new task. I will now examine how the concepts of “validity” and 
“reliability” came to be reconstructed and understood in an integrative manner. 

The concept of “validity” as an assessment method principle has traditionally been divided 
into “construct validity”, “content validity”, and “criterion-related validity”, with the latter further 
dividing into “concurrent validity” and “predictive validity”. I will first offer a brief explanation 
of these concepts. 14 

“construct validity”—describes the extent to which the assessment method is adequately 
measuring the theorized constructive concept that is taken as the object of assessment. It is there¬ 
fore necessary to accurately define one’s constructive concept beforehand. 

“content validity”—describes the extent to which the assessment method is accurately rep¬ 
resenting or abstracting the object of assessment based on the constructive concept. It is therefore 
necessary to identify the important items within the domain in question. 
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“concurrent validity”—describes the extent to which the assessment method is successful 
in measuring the constructive concept when compared to another method measuring the same con¬ 
cept. As a premise, this other method must however possess high validity. 

“predictive validity”—describes the extent to which the assessment method is able to accu¬ 
rately predict future results. As a premise, one has to however assume that the constructive concept 
remains unchanged. 

From the perspective of authentic assessment, “construct validity” is the most important 
kind of validity because it examines the evidential grounds for the “educational objectives” being 
referenced to, and describes to what extent these objectives are reflected in the assessment method. 
We can think of “content validity” as the concrete function of this “construct validity”. Furthermore, 
the criteria of the “criterion-related validity” as “concurrent validity” and “predictive validity” has 
traditionally been discussed on the basis of the “objectivity” of “relative assessment” (for example 
intelligence tests). From the perspective of “objective-referenced assessment” and “authentic assess¬ 
ment”, it is therefore necessary to examine the significance and meaning of “concurrent validity” 
and “predictive validity” when their criteria become educational objectives. 

Gipps has further refined “construct validity” into the concept of “curriculum fidelity”, an 
idea that demands that the assessment method covers the entire spectrum of the curriculum and 
that it is matched with its particular domain and level. 15 In cases where, despite the fact that quali¬ 
tatively higher-level educational objectives (for example, expressive ability or problem solving 
ability) are established, an objective test corresponding to lower-level educational objectives is used 
as an assessment method, two problems arise. Firstly, it will remain unclear whether the higher- 
level objectives have been reached or not. And secondly, children will adapt to the lower-level 
tests. Gipps suggestion of “curriculum fidelity” emphasizes this problem and urges the develop¬ 
ment of an assessment method that corresponds to qualitatively higher-level educational 
objectives. 

“Reliability” is a concept describing the extent to which the precision of the assessment 
results are stable and coherent, regardless of where, when and by whom the measurement was car¬ 
ried out. In the past, we see that the “measurement movement” that arose as criticism against “abso¬ 
lute assessment”, focused on the sole pursuit of “reliability” through the use of statistical methods, 
and as a result lost all “validity”. Flow should we then understand the concept of “reliability” within 
the context of “authentic assessment”? 

“Reliability” has traditionally been divided into the “reliability of the assessment method” 
and the “reliability of grading”. 16 The first reliability describes the extent to which the assessment 
method is stable. Methods are, for example, the “test-retest method” in which the same test group 
is measured twice with the same assessment method, and the “parallel testing method” in which 
two measurements are carried out within the same population by two assessment methods compa¬ 
rable in form and difficulty. The second type of reliability describes the extent to which the grades 
are consistent. Examples are the “inter-rater reliability” that describes the consistency of grades by 
different raters, and the “intra-rater reliability” that describes the consistency of grades for one child 
when assessed multiple times by the same rater. 

As has been pointed out, an assessment method that pursues “reliability” without regard for 
“validity” will result in an “objective test” or “standardized tests”. That is why Gipps, in tandem 
with “curriculum fidelity” as a new concept of validity, has suggested “comparability” as a new 
concept of reliability. 17 Comparability describes the extent to which different raters share the same 
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understanding of the assessment standard, and are assessing the performance of the assessment 
object impartially according to the same rubric. In order to develop an assessment method that can 
correspond to “quality”, “authentic assessment” theory is reconstructing the concepts of validity 
and reliability to try and understand them in an integrative manner. It is against this background 
that the idea of “performance assessment” was suggested. 

(b) assessing “performance” 

Performance within the context of “performance assessment” refers to the external presen¬ 
tation of an internal state of mind through for example gesture, behavior, painting, or language. 
“Performance assessment” tries to understand the rich aspects of learning as expressed through the 
involvement with “authentic tasks”. By assessing children through tasks like free essays, live dem¬ 
onstrations, and presentation, “performance assessment” aims to comprehend the qualitatively 
higher-level aspects of academic achievement like intelligence, insight, and expression. 18 

In a popular dictionary of American education “performance assessment” is described as a 
method measuring “how well students apply knowledge to the real world.” 19 This kind of descrip¬ 
tion is a common one and we also find it reflected in the B-type problems(ex. too-much informa¬ 
tion problem, free essay) of the recent “Nationwide academic achievement and study situation 
survey”. We should note however that free essays that can be carried out by nothing more than 
“pen and paper”, even though they might require “application” and “integration”, are by Wiggins 
considered as a “prompt”. Wiggins uses “performance assessment” in the restricted sense of a 
method that requires the child to perform a certain role. 20 This means that depending on the type 
and accuracy of the assessment method adopted, there will be a difference in the quality of under¬ 
standing of the performance. The performance task emphasized by Wiggins can therefore only be 
realized through synthesis with actual teaching practice, and is clearly not suited for large-scale 
simultaneous academic achievement tests. 

In order to secure the reliability of “performance assessment” we have to develop rubrics. 
A rubric consists of “scales”, descriptors that signify assessment standards, and concrete samples 
(also referred to as anchors). 21 The potential for transmission and verification of rubrics is height¬ 
ened by linking each scale with a representative sample (for example in the case of “data collec¬ 
tion”, papers written by the child). Because the scales are accompanied by pre-decided descriptors 
and samples, they can be considered “ordinal scales” or “interval scales” rather than “nominal 
scales”. 22 It is exactly because of this “reliability” that rubrics have been proposed as a method to 
guarantee the objectivity of assessment standards. 

It is important however that the rubric is also made available to children in a readily under¬ 
standable form. Although there might be some resistance to making the assessment index openly 
available, it serves a clear and significant purpose. Firstly, in the difference between “open” and 
“secret” standards lies the difference between “objective-referenced assessment” and “absolute 
assessment”. Making the rubric openly available furthermore enables the critical exploration and 
correction needed when problems arise during or after assessment. 

Even more importantly, an “open” rubric can serve children as a guideline in its study activ¬ 
ities and self assessment. A grading by use of rubrics is nothing more than an assessment of chil¬ 
dren at that particular time, and does not signify any kind of ultimate judgment. When for example 
a child receives a grading of 2 for a science experiment, it is important that the teacher and child 
share the same understanding of how to improve on the learning process in order to receive a grad¬ 
ing of 3. It is this important process of shared understanding that rubrics are expected to facilitate. 
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Although for the recent “Nationwide academic achievement and study situation survey” the equiv¬ 
alent of a rubric was presented in the form of the “Solution types” within the “Explanation mate¬ 
rials” and “Outline of the survey results”, there is still plenty enough room for improvement when 
it comes to the establishment of rubrics. It is fair to say that the “confusion” reported in newspa¬ 
pers, brought to light some fundamental problems concerning the establishment of rubrics and the 
proficiency of raters. 23 


4 Performance Assessment in Academic Achievement Surveys 

(a) high-stakes academic achievement surveys 

In America in the early 1990s the “standards movement” arose. The movement was highly 
influenced by the idea of “authentic assessment” and managed to convey their message to PISA, 
eventually leading to the adoption of performance assessment in state level academic achievement 
tests. Then, as a result of the establishment of the No Child Left Behind Act in 2001, the demand 
for accountability rose, causing the spread of high-stakes state level tests that were supposed to 
respond to this demand. The principle of competition inherent in these tests led to the reallocation 
of teachers and sometimes even pushed schools to the brink of closure. 24 

Even though it was admitted that, as method of educational assessment, “performance 
assessment” was superior over “standardized tests”, the above-mentioned situation gave rise to 
criticism against its adoption in high-stakes academic achievement surveys. 25 And although the 
“standards movement” got caught up in the high-stakes state level tests and was now focusing on 
exam training to improve test scores, there were also those who defended the original position of 
the “standards movement” as rooted in the protection of the rights of the socially vulnerable. 26 

Despite differences in approach, all criticism sprung from the shared concern that, when 
adopted in high-stakes academic achievement surveys, “performance assessment” would end up as 
a tool for indoctrination and discrimination no different than “standardized tests”, constricting the 
functioning of education and schools. It was feared that adopting rubrics would create distance 
from the child’s actual activities. 27 It is interesting to note that identical concerns were voiced 
regarding the B-type problems of the recent “Nationwide academic achievement and study situa¬ 
tion survey”. 

That “authentic assessment” has always shared these concerns can be clearly witnessed 
from the fact that its position demanding “engagement” in educational assessment also includes 
grass-roots criticism against the “standards movement” and the No Child left Behind Act that, by 
prioritizing business profit, both promote a new kind of discrimination. 28 It is against this back¬ 
ground of “engagement”, and guided by the methodological principles for academic achievement 
as implied by Gipps, that I would like to ruminate on the potential of “performance assessment”. 

(b) methodological principles for academic achievement surveys 

Besides reconstructing the concepts of validity and reliability, Gipps also suggested some 
concepts that can be considered as methodological principles for academic achievement surveys. 
One of these is the concept of “consequential validity”, which describes the kind of consequences 
that result from the practice of a certain assessment method. 29 When for example a higher-level 
assessment method with high “curriculum fidelity” is put into practice, there is a risk that teachers 
will come to emphasize exam training at the expense of reflection upon the quality of lesson study 
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and the formation of higher-level academic achievement. The concept of “consequential validity” 
serves to indicate this kind of risk. 

Overcoming this problem however will require a reform of educational practice that fully 
realizes the limits of “performance assessment” adopted in “pen and paper”-style academic achieve¬ 
ment surveys, and that will train children in daily life performances. This will demand a teaching 
strategy that is aware of the structure of performance, and that is based on the multifarious require¬ 
ments of classroom and child. 30 

Then there is the principle of “equity”, which intervenes in the process of test creation and 
grading, and tries to provide equal conditions for the examinees by taking into account various 
cultural biases like sex, nationality, race, ethnicity and class. 31 Depending on the materials used in 
the test problems of international academic achievement surveys, some cultural regions are clearly 
at a disadvantage. For example, to a child living in a country without a train network, test prob¬ 
lems that include materials on trains pose obvious difficulties. In these cases, the aim should be to 
use materials relating to the public transportation network popular in that child’s own country. We 
see that equity is based on the same emphasis on realistic context we also find in “authentic assess¬ 
ment”, and that it can help us avoid the pitfalls of academic achievement surveys. 

Lastly, there remains to be discussed the principle of “moderation”, which runs through not 
only the issue of academic achievement but through educational assessment in its entirety. 32 
Although the principle was originally proposed as a method to heighten “comparability”, it is also 
related to the emphasis on the expertise of teachers and the establishment of democracy within the 
context of educational assessment. Moderation can be divided into the method of unifying the 
assessment process, and the method of unifying the assessment results. Often used in “performance 
assessment”, the principle of group moderation is a method that unifies the assessment results in 
order to create a rubric, and it is valued as an effective method to heighten the proficiency of the 
teacher in educational assessment. 33 Because it creates a rubric in a bottom-up fashion, building up 
assessment criteria in partnership as it were, it is also a strategy that helps to avoid fixation of 
assessment criteria. In a recent work, Wiggins suggested “curriculum management”, a strategy for 
constructing schools that would realize “authentic assessment”, and would increase the expertise 
of teachers 34 . Because this implies construction of moderation on school level, we should under¬ 
stand “curriculum management” as yet another strategy to defend against the fixation of educa¬ 
tional assessment. 

As we have seen, “authentic assessment” consists of strategies and principles that, grounded 
in the awareness of “engagement”, developed in opposition to the large-scale academic achieve¬ 
ment surveys. These strategies and principles can become effective methodological principles in 
the analysis of not only the “Nationwide academic achievement and study situation survey”, but 
also the academic achievement surveys on school and district level. We especially have to pay 
attention to how the principle of moderation, which reconstructs the politics of academic achieve¬ 
ment surveys from within, is going to be utilized. While learning from American and European 
experience, also Japan is now venturing into the territory of research and practice of academic 
achievement surveys. 

Note 

1. This article was originally published in Japanese at The Japanese Journal of Educational Research, 
Vol.75,No.2,2008. 
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